Workflow Example
MaxQuant
For the analysis we require the proteinGroups.txt generated by MaxQuant and a corresponding metadata file where the sample names match with the sample names in the proteinGroups.txt file. At first take a look at the files:
[1]:
import pandas as pd
import warnings
from plotly.offline import init_notebook_mode, iplot
warnings.filterwarnings("ignore", category=UserWarning, module="openpyxl") # remove warning from mac
The proteinGroups.txt contains all standard column headers from MaxQuant. Later, for our analysis we will use the Protein Intensity described in "LFQ intensity [sample]".
[3]:
protein_groups = pd.read_csv("../../testfiles/maxquant/proteinGroups.txt", sep = "\t", low_memory=False)
protein_groups.head(5)
[3]:
| Protein IDs | Majority protein IDs | Peptide counts (all) | Peptide counts (razor+unique) | Peptide counts (unique) | Protein names | Gene names | Fasta headers | Number of proteins | Peptides | ... | Potential contaminant | id | Peptide IDs | Peptide is razor | Mod. peptide IDs | Evidence IDs | MS/MS IDs | Best MS/MS | Oxidation (M) site IDs | Oxidation (M) site positions | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | P01911;Q29830;Q9MXZ4;Q3LTJ8;Q3LTJ4;Q3LRY0;Q8HW... | P01911;Q29830;Q9MXZ4;Q3LTJ8;Q3LTJ4;Q3LRY0;Q8HW... | 4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;... | 4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;... | 0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;... | HLA class II histocompatibility antigen, DRB1-... | HLA-DRB1;HLA-DR15;HLA-DRB1*;HLA-DRB1*1327;MHC ... | ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;... | 1834.0 | 4.0 | ... | NaN | 0.0 | 1287;1288;3971;15222 | True;True;True;True | 1415;1416;4387;16993 | 117340;117341;117342;117343;117344;117345;1173... | 56384;56385;56386;159948;159949;159950;602426;... | 56384;56385;159949;602427 | NaN | NaN |
| 1 | P05121;A0A024QYT5;B7ZAB0;B7Z4X6;B7Z1D9 | P05121;A0A024QYT5;B7ZAB0;B7Z4X6;B7Z1D9 | 10;10;9;8;5 | 10;10;9;8;5 | 10;10;9;8;5 | Plasminogen activator inhibitor 1 | SERPINE1 | ;;;; | 5.0 | 10.0 | ... | NaN | 1.0 | 1592;3771;4396;4628;6470;7404;9193;11222;13191... | True;True;True;True;True;True;True;True;True;True | 1746;4166;4871;5128;7155;8188;10148;10149;1257... | 136637;136638;136639;136640;136641;318672;3186... | 64917;149749;149750;178775;178776;184126;25946... | 64917;149749;178775;184126;259465;290355;36587... | NaN | NaN |
| 2 | P55083;A0A024QZ34;K7ES70 | P55083;A0A024QZ34;K7ES70 | 2;2;2 | 2;2;2 | 2;2;2 | Microfibril-associated glycoprotein 4 | MFAP4 | ;; | 3.0 | 2.0 | ... | NaN | 2.0 | 140;15067 | True;True | 155;156;16823 | 13174;13175;13176;13177;13178;13179;13180;1318... | 7599;7600;7601;7602;7603;7604;7605;7606;598026... | 7604;598026 | 0 | 117 |
| 3 | P09972;A0A024QZ64;A8MVZ9;B7Z3K9;B7Z1N6;B7Z3K7;... | P09972;A0A024QZ64;A8MVZ9;B7Z3K9;B7Z1N6;B7Z3K7;... | 16;16;15;15;13;12;10;10;9;8;7;7;7;5 | 13;13;13;12;10;10;9;9;8;5;7;7;6;4 | 2;2;2;2;0;2;2;2;2;0;2;2;0;0 | Fructose-bisphosphate aldolase C;Fructose-bisp... | ALDOC | ;;;;;;;;; | 14.0 | 16.0 | ... | NaN | 3.0 | 312;749;1545;1675;2085;3265;5046;5708;8847;884... | True;True;True;False;True;True;True;True;True;... | 346;827;1698;1838;2304;3606;5586;5587;6314;976... | 29006;29007;29008;69091;69092;69093;69094;6909... | 15520;15521;34013;34014;34015;34016;34017;6325... | 15520;34015;63252;67942;84108;131214;198759;21... | 1;2 | 40;251 |
| 4 | Q96C19;H0Y4Y4;A0A024QZ77 | Q96C19;H0Y4Y4;A0A024QZ77 | 1;1;1 | 1;1;1 | 1;1;1 | EF-hand domain-containing protein D2 | EFHD2 | ;; | 3.0 | 1.0 | ... | NaN | 4.0 | 8772 | True | 9685 | 734937;734938;734939;734940;734941;734942;734943 | 342134;342135 | 342134 | NaN | NaN |
5 rows × 2530 columns
Further, we created a excel-file with the corresponding metadata to our proteinGroups.txt-file. The sample names in the column “sample” match the names in proteinGroups.txt file.
[4]:
metadata = pd.read_excel("../../testfiles/maxquant/metadata.xlsx")
metadata.head(5)
[4]:
| subject external_id | biological_sample external_id | sample | tissue id | disease id | intervention id | tissue | disease | biological_sample quantity | biological_sample quantity_units | ... | Alanine aminotransferase measurement (34608000) | Aspartate aminotransferase measurement (45896001) | Alkaline phosphatase measurement (88810008) | Gamma glutamyl transferase measurement (69480007) | Hemoglobin A1c measurement (43396009) | Total cholesterol:HDL ratio measurement (166842003) | High density lipoprotein measurement (17888004) | Low density lipoprotein cholesterol measurement (113079009) | VLDL cholesterol measurement (104585005) | Triglycerides measurement (14740000) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 31 | 31 | 1_31_C6 | BTO:0000131 | NaN | NaN | blood plasma | healthy | NaN | NaN | ... | 24.0 | 30 | 54 | 21.0 | 6.3 | 3.6 | 1.26 | 2.1 | 0.3 | 0.58 |
| 1 | 32 | 32 | 1_32_C7 | BTO:0000131 | NaN | NaN | blood plasma | healthy | NaN | NaN | ... | 27.0 | 28 | 27 | 38.0 | 5.8 | 6.6 | 1.70 | 4.3 | 0.6 | 1.24 |
| 2 | 33 | 33 | 1_33_C8 | BTO:0000131 | NaN | NaN | blood plasma | healthy | NaN | NaN | ... | 18.0 | 21 | 69 | 18.0 | 6.2 | 5.7 | 1.12 | 4.1 | 0.5 | 1.12 |
| 3 | 34 | 34 | 1_34_C9 | BTO:0000131 | NaN | NaN | blood plasma | healthy | NaN | NaN | ... | 22.0 | 26 | 101 | 20.0 | 6.2 | 6.7 | 0.91 | 4.8 | 1.0 | 2.20 |
| 4 | 35 | 35 | 1_35_C10 | BTO:0000131 | NaN | NaN | blood plasma | healthy | NaN | NaN | ... | 18.0 | 25 | 61 | 13.0 | 5.4 | 5.5 | 1.21 | 3.9 | 0.4 | 0.90 |
5 rows × 46 columns
0. Import AlphaStats
Import library alphastats
[2]:
import alphastats
1. Import Data
Load the MaxQuant proteinGroups.txt file and specify the columns indicating the intensity as well as the column that is used for indexing, like here the “Protein IDs” or the gene names. As the column is used for indexing, the values of this column must be unqiue.
[3]:
maxquant_data = alphastats.MaxQuantLoader(file="../../testfiles/maxquant/proteinGroups.txt",
intensity_column="LFQ intensity [sample]",
index_column="Protein IDs")
2. Create a DataSet
Combine the imported MaxQuant data with the metadata
[4]:
ds = alphastats.DataSet(
loader = maxquant_data,
metadata_path = "../../testfiles/maxquant/metadata.xlsx",
sample_column = "sample" # specify the column that corresponds to the sample names in proteinGroups
)
AlphaStats will create a matrix of the Protein Intensities, which will be accessable using ds.mat and will save the metadata as a dataframe ds.metadata. Our original MaxQuant ProteinGroup file contains much more samples, than we have metadata for
## 3. Preprocess
[8]:
print(f"Number of samples in the matrix: {ds.mat.shape[0]}, number of samples in metadata: {ds.metadata.shape[0]}.")
Number of samples in the matrix: 312, number of samples in metadata: 48.
Firstly, we will subset the matrix it will only contains samples, that are also described in the metadata.
[5]:
ds.preprocess(subset=True)
[10]:
print(f"Number of samples in the matrix: {ds.mat.shape[0]}, number of samples in metadata: {ds.metadata.shape[0]}.")
Number of samples in the matrix: 48, number of samples in metadata: 48.
Unnormalized data, Sample Distribution
[6]:
sample_distribution_plot = ds.plot_sampledistribution(color = "disease")
iplot(sample_distribution_plot)
Contaminations get removed indicated in following columns Only identified by site, Reverse, Potential contaminant (MaxQuant specific) and contamination_library (added by AlphaStats)
Normalized using quantile normalization
Missing Values get imputed using K-nearest neighbour imputation
[8]:
ds.preprocess(
remove_contaminations=True,
normalization = "quantile",
imputation = "knn"
)
/Users/drq441/opt/anaconda3/lib/python3.9/site-packages/sklearn/preprocessing/_data.py:2590: UserWarning:
n_quantiles (1000) is greater than the total number of samples (48). n_quantiles is set to n_samples.
After quantile normalization, Sample Distribution
[9]:
sample_distribution_plot_2 = ds.plot_sampledistribution(method = "box", color = "disease")
iplot(sample_distribution_plot_2)
The preprocessing steps can be accessed using:
[14]:
ds.preprocess_print_info()
Preprocessing:
The raw data contains 2611 Proteins/ProteinGroups.
The filtered data contains 2047 Proteins/ProteinGroups.Data has been normalized using quantile normalization.
Missing values were imputed using the k-Nearest Neighbor.
Contaminations indicated in following columns: ['Only identified by site', 'Reverse', 'Potential contaminant', 'contamination_library'] were removed. In total 202 observations have been removed.
4. Visualization
Principal Component Analysis (PCA)
[10]:
pca_plot = ds.plot_pca(group = "disease", circle = True)
iplot(pca_plot)
[ ]: